The Logic of Regressions
Recall that the regression coefficient is calculated based on the
covariance of \(X\) and \(Y\). If the values of the observations in a
sample differ a lot, the variance is high, if the values are relatively
similar, the variance is low.
For instance, take the list of numbers 2, 9, 67. We calculate the
variance by summing the squared difference between each number and the
mean, and divide by the number of observations (minus one if the numbers
are a sample rather than a population):
\[
\sigma^2=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n-1}
\]
numbers <- c(2,9,67)
#to create a vactor containing the numbers 2, 9 and 67
numbers
num_mean <- mean(numbers)
#to compute the mean value for all the observations in the vector numbers
num_mean
num_var_sample <- 1/(3-1) * ((2-num_mean)^2 + (9-num_mean)^2 + (67-num_mean)^2)
#to compute the sample variance
num_var_sample
num_var_pop <- 1/(3) * ((2-num_mean)^2 + (9-num_mean)^2 + (67-num_mean)^2)
#to compute the population variance
num_var_pop
# More simply:
var(numbers) # note var() automatically makes a small-sample adjustment
The variance is high (849 if the numbers represent the whole
population, 1273 if it concerns a sample).
Recall that the variance represents the average squared distance
between observations and the mean. The standard deviation, accordingly,
is the root of this average distance: 36.
sd(numbers)
To summarise the relationship between two variables, we can further
calculate the correlation. Correlations are a
standardised and non-unit-specific way of measuring the relationship
between two variables. They represent the degree to which one variable
is linearly associated with another, however, without revealing the
magnitude of that association.
cor(brexit_data$age_mean, brexit_data$Percent_Leave, use = "complete")
#the correlation is 0.38 on a scale from -1 to 1.
Since the correlation is positive, we know that as mean age increases
so, too, does the percentage of Leave voters in the
regional unit.
But how much does a one year increase in age add to the
Leave vote in general terms? Let’s use regression
to find out!
In linear regression analysis, we estimate the coefficient \(\hat{\beta}\) by dividing the covariance of
\(X\) and \(Y\) (the extent to which \(X\) and \(Y\) move together) by the variance of \(X\) (how much variation there is in \(X\), simply speaking). In other words:
To what extent do the different values that \(X\) takes explain the relationship between
\(X\) and \(Y\)?
\[ \hat{\beta}=\frac{\operatorname{COV}(X,
Y)}{\operatorname{VAR}(X)} \]
To calculate the coefficient manually, we start by calculating the
covariance of age and Leave voting:
age_leave_cov <- cov(brexit_data$Percent_Leave, brexit_data$age_mean, use = "complete")
#note the extra "use" command at the end.
#missing observations should be ignored, otherwise the command won't run.
age_leave_cov
Next we can calculate the variance in our \(X\)-variable, i.e. age:
age_var <- var(brexit_data$age_mean, use = "complete")
age_var
Finally, we can divide the covariance of \(X\) and \(Y\) by the variance of \(X\) to estimate the regression coefficient
\(\beta\):
beta <- age_leave_cov/age_var
beta
We see that - based on our linear estimation - a one year increase in
the mean age of the electoral district is associated with
an increase in the Leave vote share of the area by
1.288 %-points on average.
Linear Regression in R
Doing this calculation manually, however, is time-consuming.
Fortunately, R can easily calculate the regression
coefficients for us (including the intercept value, too, which is
automatically added to the model):
mod1 <- lm(brexit_data$Percent_Leave ~ brexit_data$age_mean)
#note that the results are stored in the working environment. Using the output of the lm command is not the most efficient way to interpret the regression - feel free to try and compare it to the code below.
#Usually, we print the output using the summary command:
summary(mod1)
You can see that the coefficient value for age_mean is
the same as the manual value (\(\hat{\beta}\)) we computed above.
Next, we might wonder how much of the variation in \(Y\) is explained by the variation in \(X\). We can find out by asking for the
R-squared, which broadly indicates what portion of the
variation is being explained by the regression model:
summary(mod1)$r.squared
We get an R-squared of 0.1463, which means that about
15 percent of the variation in the Leave vote
is explained by age. Regression tells us what accounts for the variation
in the outcome when the mean (in this case the mean Leave
vote) is not the best available explanation for a value of
\(Y\). This means that our model is
“this much better” than using the mean. Here, we conclude that
age does account for a meaningful part of the observed
variation: constituencies with a higher mean age are more
likely to vote Leave. So, we get a better
prediction of how a constituency votes by knowing the age of the voters
than just relying on the mean Leave vote.
Moreover, when we fit our estimated linear regression line, we can
predict the Leave vote share for mean ages which are
not included in our data.
plot(brexit_data$Percent_Leave ~ brexit_data$age_mean, xlim =c(30,50),
ylim =c(3,80), ylab = "Leave vote share", xlab = "Mean age", pch=16, col="blue")
abline(mod1)
# Recall why we specify the arguments in the command above: We want to create a figure that is as insightful, clear and parsimonious as possible!
#xlim = what are the limits of the x-axis? Mean ages are between 30 and 50.
#we can find this out by
min(brexit_data$age_mean, na.rm=TRUE)
#alternative code:min(brexit_data$age_mean[!is.na(brexit_data$age_mean)])
max(brexit_data$age_mean, na.rm=TRUE)
#alternative code:max(brexit_data$age_mean[!is.na(brexit_data$age_mean)])
#ylim = what are the limits of the y-axis?
min(brexit_data$Percent_Leave)
max(brexit_data$Percent_Leave) #no missing values here
#Leave vote share between 4 and 76
# xlab and ylab for naming the y and x axes
# pch and col just to make the dots prettier!
#if you just want to get a simple graph as a first idea, just run
plot(brexit_data$Percent_Leave ~ brexit_data$age_mean)
abline(mod1)
At the constituency mean age of 40.23808, we
predict a 54 percent vote share for Leave. You can
also calculate this manually without the graph by using the equation
\[\hat{Y}=\hat{\beta} X\] Based on
our simple bivariate model, our estimated value of the
Leave vote will be
\[2.611 + (1.288*age_{mean}) = 2.611 +
(40.23808*1.288) = 54.44 \] which is very close to what we got by
looking at the graph.
Manual calculation:
#estimated % Leave share = 2.611 + 1.288 * mean age
mean(brexit_data$age_mean,na.rm = TRUE)
beta # as we calculated before
# Note that you can call the estimated intercept and regression slope from the OLS model we saved earlier
mod1$coefficients[1]+mod1$coefficients[2]*mean(brexit_data$age_mean,na.rm = TRUE)
#estimated share at another value of age:
mod1$coefficients[1]+mod1$coefficients[2]*45
# Double check
2.611+1.288*40.23808
We now turn to a different question: Did immigration levels
correlate with the Leave vote - and, if so, can we predict
the Leave vote share based on the level of immigration in a
given constituency?
Let’s try to answer this by regressing the Leave vote
share on the percentage of UK-born residents in each constituency.
UKborn <- lm(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent)
summary(UKborn) #to print out the result
We get a coefficient of 0.58210 for
Birth_UK_percent. This means that a one percent increase in
the proportion of UK-born individuals is associated with an average 0.58
%-point rise in the Leave vote share. The more UK-born
residents live in the electoral district, the higher the
Leave vote!
Let’s also look at the correlation between these two variables:
cor(brexit_data$Percent_Leave, brexit_data$Birth_UK_percent, use="complete")
How do you interpret the regression coefficient? Is
it a better predictor than age? Let’s plot this and include our line of
best fit.
plot(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, pch=16, col="blue")
abline(UKborn)
Now, let’s try to look at this same relationship by creating a
variable for the share of foreign-born residents in each constituency,
before regressing the Leave vote on this newly created
variable. [Can you think of another way to create this
variable?]
brexit_data$foreignborn_percentage <- brexit_data$Birth_other_EU_percent +
brexit_data$Birth_Africa_percent +
brexit_data$Birth_MidEast_Asia_percent +
brexit_data$Birth_Americas_Carrib_percent +
brexit_data$Birth_Antarctica_Oceania_Other_percent
#foreignborn_percentage is the new variable
cor(brexit_data$foreignborn_percentage,
brexit_data$Percent_Leave, use="complete") #this time negative
foreigners <- lm(brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage)
summary(foreigners)
#foreigners stores the results of this regression, to view just type "foreigners"
plot(brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage, pch=16, col="blue")
abline(foreigners)
A concern we may have at this point is whether our results are
possibly driven by an outlier. We can get a better idea about
what is influencing our estimates by asking R to label
our observations.
plot(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, type="n")
text(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, labels=brexit_data$Region, cex=0.5)
We don’t really see much, but we can see that areas on the lower end
of both the Leave vote share and
UK-born residents are located in London. Let’s subset the
data to exclude London, and see if our estimates change.
#how many districts are in the London region?
length(brexit_data$Region[brexit_data$Region=="London"])
#we subset the data to exlude the 33 areas in London
withoutLondon <- subset(brexit_data, Region != "London")
dim(withoutLondon)
#The new number of observations is 349.
#This comes from substracting London's 33 areas form the original total of 382.
#and now we re-run the regression
immigWL <- lm(withoutLondon$Percent_Leave ~ withoutLondon$Birth_UK_percent)
summary(immigWL) #what happened to our coefficient?
What do you make of this estimate? How does it relate to your
previous model? Provide a substantive interpretation.
Linear regression models often provide the most important insights in
data analysis - it’s not by chance that they have become the gold
standard of social science analysis. Let’s present our regression result
in a nice way - we can use the stargazer package to create
a decent-looking regression table.
install.packages("stargazer") #note the quotation marks
library(stargazer)
# For example:
fit_example <- lm(yvar ~ xvar)
stargazer(fit_example, type="text")
stargazer(mod1,immigWL, type="text")
# Fully annotated/presentable version
stargazer(mod1,immigWL,
column.labels=c("Leave (%)"),
column.separate = (2), # telling R to use the label for the first two columns
dep.var.labels.include = FALSE,
covariate.labels=c("Age (Mean)", "Born in UK (%)"),
type="text")
Alternatively, you can indicate type=latex to make
stargazer produce latex code. You can then paste this code in whatever
programme you use for Latex on your machine or overleaf - the latter
comes in handy for producing professional-looking tables even if you
don’t usually use Latex.